pacman::p_load(knitr, rgdal, spdep, tmap, sf,
ggpubr, cluster, funModeling,
factoextra, NbClust, #factoextra factor analysis, access clustering results
heatmaply, corrplot, psych, tidyverse)Take Home Exercise 2 - Regionalization of Nigeria with Water points
Overview
Water is a crucial resource for humanity. People must have access to clean water in order to be healthy. It promotes a healthy environment, peace and security, and a sustainable economy. However, more than 40% of the world’s population lacks access to enough clean water. According to UN-Water, 1.8 billion people would live in places with a complete water shortage by 2025. One of the many areas that the water problem gravely threatens is food security. Agriculture uses over 70% of the freshwater that is present on Earth.
The severe water shortages and water quality issues are seen in underdeveloped countries. Up to 80% of infections in developing nations are attributed to inadequate water and sanitation infrastructure.
Despite technological advancement, providing rural people with clean water continues to be a key development concern in many countries around the world, especially in those on the continent of Africa.
The spatial patterns of non-functional water points will be shown in this study by using the proper global and local spatial association methodologies. We look at Nigeria’s in this assignment.
Data points of interest
In this assignment, we will attempt to regionalize Nigeria based on the following variables:
Total number of functional water points
Total number of nonfunctional water points
Percentage of functional water points
Percentage of non-functional water points
Percentage of main water point technology (i.e. Hand Pump)
Percentage of usage capacity (i.e. < 1000, >=1000)
Percentage of rural water points
Percentage of potable vs non potable water points
Percentage of water points accessible within median of primary road network
Percentage of water points accessible within median of secondary road network
Percentage of water points accessible within median of tertiary road network
Percentage of water points that are managed
Getting Started
First, we load the required packages in R
Spatial data handling
- sf, rgdal and spdep
Attribute data handling
- knitr, tidyverse, funModeling especially readr, ggplot2 and dplyr
Choropleth mapping
- tmap
Multivariate data visualization and analysis
- coorplot, ggpubr, and heatmaply
Cluster analysis
cluster
ClustGeo
Spatial Data
The spatial dataset used in this assignment is the Nigeria Level-2 Administrative Boundary spatial dataset downloaded from Geoboundaries
We will load the spatial features by using st_read() from the sf package
As the data we want is in WSG-84 format, we set crs to 4326.
We won’t utilize st_transform() at this time because it can result in outputs with missing points after transformation, which would skew our study.
nga = st_read(dsn = "data/geospatial",
layer = "geoBoundaries-NGA-ADM2",
crs = 4326)Reading layer `geoBoundaries-NGA-ADM2' from data source
`D:\Allanckw\ISSS624\Take-Home_Ex2\data\geospatial' using driver `ESRI Shapefile'
Simple feature collection with 774 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 2.668534 ymin: 4.273007 xmax: 14.67882 ymax: 13.89442
Geodetic CRS: WGS 84
We could use st_crs()to verify the coordinate system from the object.
st_crs(nga)Coordinate Reference System:
User input: EPSG:4326
wkt:
GEOGCRS["WGS 84",
ENSEMBLE["World Geodetic System 1984 ensemble",
MEMBER["World Geodetic System 1984 (Transit)"],
MEMBER["World Geodetic System 1984 (G730)"],
MEMBER["World Geodetic System 1984 (G873)"],
MEMBER["World Geodetic System 1984 (G1150)"],
MEMBER["World Geodetic System 1984 (G1674)"],
MEMBER["World Geodetic System 1984 (G1762)"],
MEMBER["World Geodetic System 1984 (G2139)"],
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ENSEMBLEACCURACY[2.0]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic latitude (Lat)",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["geodetic longitude (Lon)",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
USAGE[
SCOPE["Horizontal component of 3D system."],
AREA["World."],
BBOX[-90,-180,90,180]],
ID["EPSG",4326]]
Updating Spatial features that have identical name but are in different states
The following code determines whether any LGA names have been repeated. If the shapeName is not distinct for rows, duplicated() returns True. Only rows that satisfy the duplicated rows = True criterion are returned by subset() (Ong, 2022).
duplicate = nga$shapeName[nga$shapeName %in% nga$shapeName[duplicated(nga$shapeName)]]
duplicate [1] "Bassa" "Bassa" "Ifelodun" "Ifelodun" "Irepodun" "Irepodun"
[7] "Nasarawa" "Nasarawa" "Obi" "Obi" "Surulere" "Surulere"
From the result, there are 12 LGAs that have the same names, even though they are from different states. We will want to rename them or it would be confusing when we conduct further analysis.
Firstly, we create a data frame with the duplicate
nga_dupes = nga %>%
filter(shapeName %in% duplicate)Calling ttm() in the tmap package will switch the tmap’s viewing mode to interactive viewing, which will help us find the states of the respective duplicates in the map. We then plot the map with tmap functions
ttm()
tm_shape(nga_dupes) +
tm_polygons("shapeName") +
tm_borders(alpha=0.5) From the result, we can observe the following:
| shapeID | shapeName |
|---|---|
| NGA-ADM2-72505758B95534398 | Bassa (Kogi) |
| NGA-ADM2-72505758B52690633 | Bassa (Plateau) |
| NGA-ADM2-72505758B26581542 | Ifelodun (Kwara) |
| NGA-ADM2-72505758B18326272 | Ifelodun (Osun) |
| NGA-ADM2-72505758B75034141 | Irepodun (Kwara) |
| NGA-ADM2-72505758B79178637 | Irepodun (Osun) |
| NGA-ADM2-72505758B6786568 | Nasarawa (Kano) |
| NGA-ADM2-72505758B67188591 | Nasarawa (Nasarawa) |
| NGA-ADM2-72505758B7318634 | Obi (Benue) |
| NGA-ADM2-72505758B3073896 | Obi (Nasarawa) |
| NGA-ADM2-72505758B6675111 | Surulere (lagos) |
| NGA-ADM2-72505758B31597260 | Surulere (Oyo) |
Based on the results above, we replace the shape names with shapeID as the identifier and run the duplicate check again to verify that it is now empty
nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B95534398'] = 'Bassa (Kogi)'
nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B52690633'] = 'Bassa (Plateau)'
nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B26581542'] = 'Ifelodun (Kwara)'
nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B18326272'] = 'Ifelodun (Osun)'
nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B75034141'] = 'Irepodun (Kwara)'
nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B79178637'] = 'Irepodun (Osun)'
nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B6786568'] = 'Nasarawa (Kano)'
nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B67188591'] = 'Nasarawa (Nasarawa)'
nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B7318634'] = 'Obi (Benue)'
nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B3073896'] = 'Obi (Nasarawa)'
nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B6675111'] = 'Surulere (lagos)'
nga$shapeName[nga$shapeID=='NGA-ADM2-72505758B31597260'] = 'Surulere (Oyo)'
duplicate = nga$shapeName[nga$shapeName %in% nga$shapeName[duplicated(nga$shapeName)]]
duplicatecharacter(0)
Aspatial Data
Cleaning the Data
The aspatial dataset used in this assignment is the water point data exchange dataset found in WPdx Global Data Repositories. Data is filtered on the web portal to only keep Nigeria and the file is saved as NigeriaWaterPoints_Raw.csv
As we are only interested in the functionality of the water point, it is important to capture fields that may may aid us in our analysis (Definition are found here: Source)
LGA: The area we are interested in
State: The state of the LGA of Nigeria
Functional: Whether it is functional or not
Management: who manages it?
Quality: what is the quality?
Water Source Category: where the water came from?
Water Tech Category: What technology is used?
is_urban: Is it in an urban area?
#distance_to_primary_road: Based on calculations with data from OpenStreetMap, the distance in km to the nearest road.
#distance_to_secondary_road: Based on calculations using OpenStreetMap data, the distance in km to the second-closest road.
#distance_to_tertiary_road: Using calculations with data from OpenStreetMap, the distance to the third-closest road is given in km.
Usage Capacity: Maximum recommended number of users per waterpoint
New Georeferenced Column: : Well Known Text (wkt) representing spatial data in a textual format
To load the raw data file, we use the read_csv function
wpdx_raw = read_csv("data/aspatial/NigeriaWaterPoints_Raw.csv") Most of the columns are irrelevant, so we will perform the following:
keep the columns we want to clean it up by specifying the columns with one to retain with
subsetrenaming the columns using
rename_withReplace all the NA with unknown for columns with NA value present
retain_cols = c('#clean_adm2', '#clean_adm1', '#status_clean', '#management_clean'
, '#subjective_quality', '#water_source_category',
'#water_tech_category',
'New Georeferenced Column', 'is_urban','#distance_to_primary_road',
'#distance_to_secondary_road', '#distance_to_tertiary_road'
,'usage_capacity')
new_col_names = c('LGA', 'State', 'Functional', 'Management',
'Quality', 'Water_Source_Category', 'Water_Tech_Category',
'WKT', 'Is_Urban_Area', 'dist_to_primary_road'
,'dist_to_secondary_road', 'dist_to_tertiary_road'
,'usage_capacity')
wpdx_clean = subset(wpdx_raw, select = (names(wpdx_raw) %in% retain_cols)) %>% rename_with(~ new_col_names, all_of(retain_cols)) %>%
replace_na(list(Functional = "Unknown", Management = "Unknown", Quality = "Unknown", Water_Source_Category = "Unknown", Water_Tech_Category = "Unknown"))We save the processed data frame wpdx_clean into a file, the file will be reduced to 3.4 MB from the 144 MB raw file that we downloaded.
saveRDS(wpdx_clean, "data/aspatial/wpdx_clean_ex2.rds")We can then delete the raw file from the project and retrieve the saved RDS file using readRDS()
wpdx_clean = readRDS("data/aspatial/wpdx_clean_ex2.rds")Converting csv data into spatial features
We can use st_as_sfc() to come up with the new field Geometry by using the WKT field
wpdx_clean$Geometry = st_as_sfc(wpdx_clean$`WKT`)We will then use st_sf() to convert the tibble data frame into sf data frame. The EPSG 4326 code is used as the dataset is referencing WGS84 geographic coordinate system.
We could use st_crs()to verify the coordinate system from the object.
wpdx_clean_sf = st_sf(wpdx_clean, crs=4326)
st_crs(wpdx_clean_sf)Coordinate Reference System:
User input: EPSG:4326
wkt:
GEOGCRS["WGS 84",
ENSEMBLE["World Geodetic System 1984 ensemble",
MEMBER["World Geodetic System 1984 (Transit)"],
MEMBER["World Geodetic System 1984 (G730)"],
MEMBER["World Geodetic System 1984 (G873)"],
MEMBER["World Geodetic System 1984 (G1150)"],
MEMBER["World Geodetic System 1984 (G1674)"],
MEMBER["World Geodetic System 1984 (G1762)"],
MEMBER["World Geodetic System 1984 (G2139)"],
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ENSEMBLEACCURACY[2.0]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic latitude (Lat)",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["geodetic longitude (Lon)",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
USAGE[
SCOPE["Horizontal component of 3D system."],
AREA["World."],
BBOX[-90,-180,90,180]],
ID["EPSG",4326]]
We can then use glimpse() to verify each field’s data type & available values.
glimpse(wpdx_clean_sf)Rows: 95,008
Columns: 14
$ Water_Source_Category <chr> "Unknown", "Well", "Well", "Well", "Well", "Wel…
$ Water_Tech_Category <chr> "Tapstand", "Mechanized Pump", "Hand Pump", "Un…
$ State <chr> "Ekiti", "Ogun", "Ebonyi", "Enugu", "Enugu", "B…
$ LGA <chr> "Moba", "Obafemi-Owode", "Ohaukwu", "Isi-Uzo", …
$ Management <chr> "Unknown", "Other", "Unknown", "Unknown", "Unkn…
$ Functional <chr> "Unknown", "Functional", "Unknown", "Unknown", …
$ Quality <chr> "Unknown", "Acceptable quality", "Unknown", "Un…
$ dist_to_primary_road <dbl> 767.3742, 13364.9005, 9492.7619, 9319.0815, 543…
$ dist_to_secondary_road <dbl> 921.79187, 48.87743, 4333.34280, 23276.33227, 1…
$ dist_to_tertiary_road <dbl> 3146.733237, 4167.519068, 693.211204, 307.71619…
$ usage_capacity <dbl> 250, 1000, 300, 300, 300, 300, 300, 1000, 300, …
$ Is_Urban_Area <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ WKT <chr> "POINT (5.12 7.98)", "POINT (3.5976683 6.964531…
$ Geometry <POINT [°]> POINT (5.12 7.98), POINT (3.597668 6.9645…
Lastly, the attribute data from the nga sf data frame will be transferred into the wpdx_clean_sf data frame using a geoprocessing function known as point-in-polygon overlay. st_join() from the sf package can help us with that
nga_wp = st_join(wpdx_clean_sf, nga)Exploratory Data Analysis (EDA)
We can use freq() of the funModeling package to display the distribution of our data points of interest using wpdx_clean_sf. This is to help us aggregate the data as the dataset provide breakdowns of their respective catagories.
Categorical Variables
freq(data=nga_wp, input = 'Functional')
Functional frequency percentage cumulative_perc
1 Functional 45883 48.29 48.29
2 Non-Functional 29385 30.93 79.22
3 Unknown 10656 11.22 90.44
4 Functional but needs repair 4579 4.82 95.26
5 Non-Functional due to dry season 2403 2.53 97.79
6 Functional but not in use 1686 1.77 99.56
7 Abandoned/Decommissioned 234 0.25 99.81
8 Abandoned 175 0.18 99.99
9 Non functional due to dry season 7 0.01 100.00
Checking out functionality, we know that functional water points are broken down into Functional, Functional but needs repair, and Functional but not in use. 48.29% are functional, 4.82% are functional but needs repair and 1.77% of them are functional but not in use.
freq(data=nga_wp, input = 'Management')
Management frequency percentage cumulative_perc
1 Unknown 35917 37.80 37.80
2 Community Management 31655 33.32 71.12
3 Direct Government Operation 16871 17.76 88.88
4 Other 8361 8.80 97.68
5 School Management 1435 1.51 99.19
6 Health Care Facility 440 0.46 99.65
7 Private Operator/Delegated Management 194 0.20 99.85
8 Other Institutional Management 135 0.14 100.00
Checking out water point management, we know that water points are mostly managed except 37.8% which are unknowns
freq(data=nga_wp, input = 'Quality')
Quality frequency percentage cumulative_perc
1 Acceptable quality 71801 75.57 75.57
2 Unknown 10625 11.18 86.75
3 No because of Taste 6712 7.06 93.81
4 No because of Colour 3986 4.20 98.01
5 No because of Odour 1847 1.94 99.95
6 Within National limits (potable) 19 0.02 99.97
7 Within National standards (potable) 18 0.02 100.00
Checking out the quality of water points, we know that 75.57% of them are of acceptable quality, with an additional 0.04% of them within potable national limits / standards. The rest are either unknown or of unacceptable quality.
freq(data=nga_wp, input = 'Water_Source_Category')
Water_Source_Category frequency percentage cumulative_perc
1 Well 91540 96.35 96.35
2 Spring 3163 3.33 99.68
3 Unknown 302 0.32 100.00
4 Piped Water 3 0.00 100.00
Checking out the source of water points, we know that majority of them (96.35%) comes from well. With such a high number, this variable may not be useful for our analysis as it most of the data points will come have this value.
freq(data=nga_wp, input = 'Water_Tech_Category')
Water_Tech_Category frequency percentage cumulative_perc
1 Hand Pump 58755 61.84 61.84
2 Mechanized Pump 25644 26.99 88.83
3 Unknown 10055 10.58 99.41
4 Tapstand 553 0.58 99.99
5 Rope and Bucket 1 0.00 100.00
Checking out the technology the water point uses, we know that water points are broken down into Hand pumps, Mechanized Pump, Tapstand, Rope and bucket. 61.84% of water points operates on hand pumps, 26.99% on mechanized pumps, 10.58% are unknowns and a minority of them (less than 0.58%) are either on tapstand or Rope and Bucket.
freq(data=nga_wp, input = 'Is_Urban_Area')
Is_Urban_Area frequency percentage cumulative_perc
1 FALSE 75444 79.41 79.41
2 TRUE 19564 20.59 100.00
From the records, only 20.59% of the water points are in urban areas, while the rest of them (79.41%) are in the rural areas.
freq(data=nga_wp, input = 'usage_capacity')
usage_capacity frequency percentage cumulative_perc
1 300 68789 72.40 72.40
2 1000 25644 26.99 99.39
3 250 573 0.60 99.99
4 50 2 0.00 100.00
Majority of the water points caters to 300 people or less (73.1%), while 26.99% of the water points caters to a capacity of 1000 people
Continuous Variables
We will need to find out the summary of the respective distance in order to categorize them appropriately for analysis. We can achieve that by using summary statistics in R
summary(nga_wp$dist_to_primary_road) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01 1361.68 6647.50 10177.43 15132.29 82666.56
summary(nga_wp$dist_to_secondary_road) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 788.6 4446.0 7934.7 11083.7 94773.0
summary(nga_wp$dist_to_tertiary_road) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 261.5 1442.4 3853.2 5114.6 58874.0
Based on the results above, we know that the median distance (in km) to primary, secondary and tertiary roads is 6647.50 km, 4446.0 km and 1442.4 km respectively. We can use within median distance to analyze if accessibility to water points is a factor to regionalization
Aggregate the Data
To aggregate the variable of interest, we will create new data frames to store them by using the filter function. Variables names used are self explanatory.
func_list = c("Functional", "Functional but needs repair", "Functional but not in use")
wpt_functional_true = wpdx_clean_sf %>%
filter(Functional %in% func_list)
wpt_functional_false = wpdx_clean_sf %>%
filter(!Functional %in% c(func_list, "Unknown"))
wpt_rural = wpdx_clean_sf %>%
filter(Is_Urban_Area == FALSE)
wpt_handpumps_true = wpdx_clean_sf %>%
filter(Water_Tech_Category %in% "Hand Pump")
wpt_handpumps_false = wpdx_clean_sf %>%
filter(!Water_Tech_Category %in% "Hand Pump")
wpt_potable_true = wpdx_clean_sf %>%
filter(Quality %in% c("Acceptable quality", "Within National standards (potable)
", "Within National limits (potable)"))
wpt_potable_false = wpdx_clean_sf %>%
filter(!Quality %in% c("Acceptable quality", "Within National standards (potable)
", "Within National limits (potable)", "Unknown"))
wpt_potable_unknown = wpdx_clean_sf %>%
filter(!Quality %in% "Unknown")
wpt_usageOver1000 = wpdx_clean_sf %>%
filter(usage_capacity >= 1000)
wpt_usageUnder1000 = wpdx_clean_sf %>%
filter(usage_capacity < 1000)
wpt_accessibility_PriRd_lessThanMedian = wpdx_clean_sf %>%
filter(dist_to_primary_road < 6647.50)
wpt_accessibility_PriRd_moreThanMedian = wpdx_clean_sf %>%
filter(dist_to_primary_road >= 6647.50)
wpt_accessibility_secRd_lessThanMedian = wpdx_clean_sf %>%
filter(dist_to_secondary_road < 4446.0 )
wpt_accessibility_secRd_moreThanMedian = wpdx_clean_sf %>%
filter(dist_to_secondary_road >= 4446.0 )
wpt_accessibility_TerRd_lessThanMedian = wpdx_clean_sf %>%
filter(dist_to_tertiary_road < 1442.4 )
wpt_accessibility_TerRd_moreThanMedian = wpdx_clean_sf %>%
filter(dist_to_tertiary_road >= 1442.4 )
wpt_managed_true = wpdx_clean_sf %>%
filter(!Management %in% "Unknown")
wpt_managed_unknown = wpdx_clean_sf %>%
filter(Management %in% "Unknown")Computing our data points of interest
We can use st_intersects() to find common data points between geographical datasets. In our case we need to find the common points in the Nigeria’s LGA spatial dataset and the water point aspatial dataset
The below code adds new columns to compute the total number of water points of our data point of interests
As an example, the below code intersects the Nigeria LGA dataset (nga_wp dataframe) with the water point dataset (wpdx_clean_sf dataframe) and produce a new column to denote the total number of water points in the area (Total wpt) by using mutate() and lengths()
nga_wp = nga %>%
#combine nga with water point sf
mutate(`total wpt` = lengths(
st_intersects(nga, wpdx_clean_sf)))Similarly, for the rest of the variables, we piped the output and add the new columns to denote the number of
water points are functional, non functional and unknown
Water points that are in rural areas
Water points that either uses Hand pumps or otherwise
Water points that have either potable or non potable water
Water points with usage capacity of over 1000 persons or under
Water points within or over Median distance of Primary, Secondary and Tertiary roads
Water points management that are managed
nga_wp = nga %>%
#combine nga with water point sf
mutate(`total wpt` = lengths(
st_intersects(nga, wpdx_clean_sf))) %>%
#add columns to produce no. of functional, non functional and unknown points
mutate(`wpt functional` = lengths(
st_intersects(nga, wpt_functional_true))) %>%
mutate(`wpt non functional` = lengths(
st_intersects(nga, wpt_functional_false))) %>%
# rural
mutate(`isRural` = lengths(
st_intersects(nga, wpt_rural))) %>%
# hand pumps
mutate(`Uses Handpumps` = lengths(
st_intersects(nga, wpt_handpumps_true))) %>%
# Potable
mutate(`Potable` = lengths(
st_intersects(nga, wpt_potable_true))) %>%
mutate(`Non Potable` = lengths(
st_intersects(nga, wpt_potable_false))) %>%
# Usage Capacity
mutate(`usage Over 1000` = lengths(
st_intersects(nga, wpt_usageOver1000))) %>%
mutate(`usage Under 1000` = lengths(
st_intersects(nga, wpt_usageUnder1000))) %>%
#Primary Road
mutate(`Within Median Distance to Pri Road` = lengths(
st_intersects(nga, wpt_accessibility_PriRd_lessThanMedian))) %>%
mutate(`Exceed Median Distance to Pri Road` = lengths(
st_intersects(nga, wpt_accessibility_PriRd_moreThanMedian))) %>%
#Secondary Road
mutate(`Within Median Distance to Sec Road` = lengths(
st_intersects(nga, wpt_accessibility_secRd_lessThanMedian))) %>%
mutate(`Exceed Median Distance to Sec Road` = lengths(
st_intersects(nga, wpt_accessibility_secRd_moreThanMedian))) %>%
#Tertiary Road
mutate(`Within Median Distance to Ter Road` = lengths(
st_intersects(nga, wpt_accessibility_PriRd_lessThanMedian))) %>%
mutate(`Exceed Median Distance to Ter Road` = lengths(
st_intersects(nga, wpt_accessibility_TerRd_moreThanMedian))) %>%
#management
mutate(`Managed` = lengths(
st_intersects(nga, wpt_managed_true))) Once we are done with the raw data, we use the below code to compute the respective percentages
percentage of functional water points
percentage of non functional water points
percentage of water points with hand pumps as technology
percentage of water points with potable / non potable water
percentage of water points with capacity over/under 1000
percentage of water points that are managed
percentage of water points in rural areas
percentage of water points are within the median distance to primary road network
percentage of water points are within the median distance to secondary road network
percentage of water points are within the median distance to tertiary road network
nga_wp = nga_wp %>%
#add columns to compute %
mutate(`pct_functional` = `wpt functional`/`total wpt`) %>%
mutate(`pct_non-functional` = `wpt non functional`/`total wpt`) %>%
mutate(`pct_Handpump` = `Uses Handpumps`/`total wpt`) %>%
mutate(`pct_Potable` = `Potable`/`total wpt`) %>%
mutate(`pct_NonPotable` = `Non Potable`/`total wpt`) %>%
mutate(`pct_Cap_Over_1000` = `usage Over 1000`/`total wpt`) %>%
mutate(`pct_Cap_Under_1000` = `usage Under 1000`/`total wpt`) %>%
mutate(`pct_managed` = `Managed`/`total wpt`) %>%
mutate(`pct_rural` = `isRural`/`total wpt`) %>%
mutate(`pct_w_meddist_to_PriRoad` = `Within Median Distance to Pri Road`/`total wpt`) %>%
mutate(`pct_w_meddist_to_SecRoad` = `Within Median Distance to Sec Road`/`total wpt`) %>%
mutate(`pct_w_meddist_to_TerRoad` = `Within Median Distance to Ter Road`/`total wpt`) After we are done, we will inspect the data by viewing the data frame, we can observe that there are some rows that have NaN due to division by zero. Lets replace those values to 0 with the is.na() function and change the row names to the names of the LGAs with row.names() by using the code below
nga_wp[is.na(nga_wp)] = 0
row.names(nga_wp) = nga_wp$shapeNameWe will only keep the following variables as the rest are intermediate data points that are no longer relevant
wpt functional
wpt non functional
pct_functional
pct_non-functional
pct_Handpump
pct_Potable
pct_NonPotable
pct_Cap_Over_1000
pct_Cap_Under_1000
pct_managed
pct_rural
pct_w_meddist_to_PriRoad
pct_w_meddist_to_SecRoad
pct_w_meddist_to_TerRoad
nga_wp_interested_data_pts = nga_wp %>%
select(8:9, 23:34) We save the processed data frame nga_wp_interested_data_pts into a file with saveRDS(), so that we do not need to process it again
saveRDS(nga_wp_interested_data_pts, "data/geospatial/nga_wp_interested_data_pts.rds")We can then retrieve the saved RDS file using readRDS()
nga_wp_interested_data_pts = readRDS("data/geospatial/nga_wp_interested_data_pts.rds")EDA using statistical graphics
A Histogram is useful to identify the overall distribution of the data values (i.e. positively skew, negatively skew or normal distribution)
We use ggplot2’s histogram (geom_histogram) to plot the percentage functional and non functional water points with the mean and median as abline using geom_vline()
We plot them side by side using ggarrange
functional_pct_histo = ggplot(data=nga_wp_interested_data_pts,
aes(x= `pct_functional`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
labs(x = "% Functional", y = "Frequency") +
geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$pct_functional)),
color="red", linetype="dashed", linewidth=1) +
geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$pct_functional)),
color="brown", linetype="dashed", linewidth=1)
nonfunctional_pct_histo = ggplot(data=nga_wp_interested_data_pts,
aes(x= `pct_non-functional`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
labs(x = "% Non Functional", y = "Frequency") +
geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_non-functional` )),
color="red", linetype="dashed", linewidth=1) +
geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_non-functional`)),
color="brown", linetype="dashed", linewidth=1)
ggarrange(functional_pct_histo, nonfunctional_pct_histo, ncol = 2, nrow = 1)
From the result, we can observe that both functional and non functional water points distributions are positively skewed. The functional water point mean and median is close to 50%, whereas the non functional water points median and mean are very close to each other at 35.05% and 35.92% respectively
We can also use box plot to detect outliers with geom_boxplot. We plot them side by side using ggarrange
functional_boxplot = ggplot(data=nga_wp_interested_data_pts,
aes(x=`pct_functional`)) +
labs(x = "% Functional") +
geom_boxplot(color="black", fill="light blue")
nonfunctional_boxplot = ggplot(data=nga_wp_interested_data_pts,
aes(x=`pct_non-functional`)) +
labs(x = "% Non Functional") +
geom_boxplot(color="black", fill="light blue")
ggarrange(functional_boxplot, nonfunctional_boxplot, ncol = 2, nrow = 1)
We can see that there isn’t any outliers in the functional water points, but for the non functional water points, there is an outlier where 100% of the water points are not functional.
In the figure below, multiple histograms are plotted to reveal the distribution of the selected variables in the ict_derived data.frame. First, We do this by creating all the histograms assigned to individual variables.
functional = ggplot(data=nga_wp_interested_data_pts,
aes(x= `wpt functional`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
labs(x = "No. of Functional", y = "Frequency") +
geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`wpt functional`)),
color="red", linetype="dashed", linewidth=1) +
geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`wpt functional`)), color="brown", linetype="dashed", linewidth=1)
nonfunctional = ggplot(data=nga_wp_interested_data_pts,
aes(x= `wpt non functional`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
labs(x = "No. of Non Functional", y = "Frequency") +
geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`wpt non functional` )),color="red", linetype="dashed", linewidth=1) +
geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`wpt non functional`)),color="brown", linetype="dashed", linewidth=1)
functional_pct = ggplot(data=nga_wp_interested_data_pts,
aes(x= `pct_functional`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
labs(x = "% Functional", y = "Frequency") +
geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_functional` )),color="red", linetype="dashed", linewidth=1) +
geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_functional`)), color="brown", linetype="dashed", linewidth=1)
nonfunctional_pct = ggplot(data=nga_wp_interested_data_pts,
aes(x= `pct_non-functional`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
labs(x = "% Non Functional", y = "Frequency") +
geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_non-functional` )),
color="red", linetype="dashed", linewidth=1) +
geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_non-functional`)),
color="brown", linetype="dashed", linewidth=1)
pct_Handpump = ggplot(data=nga_wp_interested_data_pts,
aes(x= `pct_Handpump`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
labs(x = "% Handpump", y = "Frequency") +
geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_Handpump` )),
color="red", linetype="dashed", linewidth=1) +
geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_Handpump`)),
color="brown", linetype="dashed", linewidth=1)
pct_Potable = ggplot(data=nga_wp_interested_data_pts,
aes(x= `pct_Potable`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
labs(x = "% Potable", y = "Frequency") +
geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_Potable` )),
color="red", linetype="dashed", linewidth=1) +
geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_Potable`)),
color="brown", linetype="dashed", linewidth=1)
pct_NonPotable = ggplot(data=nga_wp_interested_data_pts,
aes(x= `pct_NonPotable`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
labs(x = "% Non Potable", y = "Frequency") +
geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_NonPotable` )), color="red", linetype="dashed", linewidth=1) +
geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_NonPotable`)), color="brown", linetype="dashed", linewidth=1)
pct_Usage_Capacity_Over_1000 = ggplot(data=nga_wp_interested_data_pts,
aes(x= `pct_Cap_Over_1000`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
labs(x = "% Usage capacity >= 1000", y = "Frequency") +
geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_Cap_Over_1000` )),
color="red", linetype="dashed", linewidth=1) +
geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_Cap_Over_1000`)),
color="brown", linetype="dashed", linewidth=1)
pct_Usage_Capacity_Under_1000 = ggplot(data=nga_wp_interested_data_pts,
aes(x= `pct_Cap_Under_1000`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
labs(x = "% Usage capacity < 1000", y = "Frequency") +
geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_Cap_Under_1000` )),
color="red", linetype="dashed", linewidth=1) +
geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_Cap_Under_1000`)),
color="brown", linetype="dashed", linewidth=1)
pct_managed = ggplot(data=nga_wp_interested_data_pts,
aes(x= `pct_managed`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
labs(x = "% Managed", y = "Frequency") +
geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_managed` )),
color="red", linetype="dashed", linewidth=1) +
geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_managed`)),
color="brown", linetype="dashed", linewidth=1)
pct_rural = ggplot(data=nga_wp_interested_data_pts,
aes(x= `pct_rural`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
labs(x = "% Rural", y = "Frequency") +
geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_rural` )),
color="red", linetype="dashed", linewidth=1) +
geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_rural`)),
color="brown", linetype="dashed", linewidth=1)
pct_within_median_dist_to_pri_road = ggplot(data=nga_wp_interested_data_pts,
aes(x= `pct_w_meddist_to_PriRoad`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
labs(x = "% Rural", y = "Frequency") +
geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_w_meddist_to_PriRoad` )),
color="red", linetype="dashed", linewidth=1) +
geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_w_meddist_to_PriRoad`)),
color="brown", linetype="dashed", linewidth=1)
pct_within_median_dist_to_sec_road = ggplot(data=nga_wp_interested_data_pts,
aes(x= `pct_w_meddist_to_SecRoad`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
labs(x = "% Rural", y = "Frequency") +
geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_w_meddist_to_SecRoad` )),
color="red", linetype="dashed", linewidth=1) +
geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_w_meddist_to_SecRoad`)),
color="brown", linetype="dashed", linewidth=1)
pct_within_median_dist_to_ter_road = ggplot(data=nga_wp_interested_data_pts,
aes(x= `pct_w_meddist_to_TerRoad`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
labs(x = "% Rural", y = "Frequency") +
geom_vline(aes(xintercept = mean(nga_wp_interested_data_pts$`pct_w_meddist_to_TerRoad` )),
color="red", linetype="dashed", linewidth=1) +
geom_vline(aes(xintercept=median(nga_wp_interested_data_pts$`pct_w_meddist_to_TerRoad`)),
color="brown", linetype="dashed", linewidth=1)
ggarrange(functional, nonfunctional,
functional_pct, nonfunctional_pct,
ncol = 2, nrow = 2)
ggarrange(pct_Potable, pct_NonPotable,
pct_Usage_Capacity_Over_1000, pct_Usage_Capacity_Under_1000,
pct_managed, pct_rural,
ncol = 2, nrow = 3)
ggarrange(pct_within_median_dist_to_pri_road,
pct_within_median_dist_to_sec_road,
pct_within_median_dist_to_ter_road,
pct_Handpump,
ncol = 2, nrow = 2)
From the charts, we can observe that
No. of functional water points & non functional water points are positively skewed
percentage of functional follows closely to a normal distribution while percentage of non functional is positively skewed
percentage of water points with potable water is negatively skewed while percentage of water points with non potable water is positively skewed
percentage of water points with usage capacity over 1000 is positively skewed while percentage of water points with usage capacity less than 1000 is negatively skewed
percentage of managed water points is negatively skewed
percentage of rural water points is negatively skewed
percentage of water points with primary road within median distance is negatively skewed
percentage of water points with secondary road within median distance is negatively skewed
percentage of water points with tertiary road within median distance is negatively skewed
percentage of water points with hand pumps is negatively skewed.
As functional and non functional water points are NOT of the same units as the other variables in percentage, we will need to normalize it later.
Visualizing the spatial distribution of water points
We will plot the map with the tmap package. In this exercise we will use the jenks style as it looks for clusters of related values and highlights the differences between categories. (Nowosad, 2019)
nga_wp_interested_data_pts.map.pct_func =
tm_shape(nga_wp_interested_data_pts) +
tm_fill("pct_functional",
palette ="PuRd", style="jenks", n = 5) +
tm_borders(alpha=0.5) +
tm_grid (alpha=0.2) +
tm_layout(main.title="% functional WP - 2 Layer map",
main.title.position="center",
main.title.size=0.8,
frame = TRUE)
nga_wp_interested_data_pts.map.pct_nonfunc =
tm_shape(nga_wp_interested_data_pts) +
tm_fill("pct_non-functional",
palette ="PuRd", style="jenks", n = 5) +
tm_borders(alpha=0.5) +
tm_grid (alpha=0.2) +
tm_layout(main.title="% non functional WP - 2 Layer map",
main.title.position="center",
main.title.size=0.8,
frame = TRUE)
tmap_arrange(nga_wp_interested_data_pts.map.pct_func, nga_wp_interested_data_pts.map.pct_nonfunc, asp=1, ncol=2)From the results, we can observe the following:
The north east area of Nigeria have little to no water point at all
Functional water points congregate in the northern side of the country
Non functional water points congregates around the south western coast of Nigera, facing gulf of Guinea & Bight of Benin.
Correlation Analysis
Data Standardization
Reference
Calkins K. G (2005) Applied Statistics - Lesson 5, Correlation Coefficients
https://www.andrews.edu/~calkins/math/edrm611/edrm05.htm#:~:text=Correlation%20coefficients%20whose%20magnitude%20are,can%20be%20considered%20highly%20correlated.
Nowosad J. (2019), Map coloring: the color scale styles available in the tmap package https://geocompr.github.io/post/2019/tmap-color-scales/
ONG, J (2022), Geospatial Analytics for Social Good - Understanding Nigeria Water functional and non-functional water point rate https://jordan-isss624-geospatial.netlify.app/posts/geo/geospatial_exercise/#visualising-of-distribution-using-ggplot
